Getting personal with epigenetics: towards individual-specific epigenomic imputation with machine learning

Epigenetic modifications are dynamic mechanisms involved in the regulation of gene expression. Unlike the DNA sequence, epigenetic patterns vary not only between individuals, but also between different cell types within an individual. Environmental factors, somatic mutations and ageing contribute to epigenetic changes that may constitute early hallmarks or causal factors of disease. Epigenetic modifications are reversible and thus promising therapeutic targets for precision medicine. However, mapping efforts to determine an individual’s cell-type-specific epigenome are constrained by experimental costs and tissue accessibility. To address these challenges, we developed eDICE, an attention-based deep learning model that is trained to impute missing epigenomic tracks by conditioning on observed tracks. Using a recently published set of epigenomes from four individual donors, we show that transfer learning across individuals allows eDICE to successfully predict individual-specific epigenetic variation even in tissues that are unmapped in a given donor. These results highlight the potential of machine learning-based imputation methods to advance personalized epigenomics.


MSE Global
Fg         21. In bold, we highlighted the best performance for each metric. For each performance metric, we checked whether any model other than the best one failed to reject the null hypothesis of a shared mean in a two-sided paired t-test with a significance level of 0.01 (N=203). No comparable performance was found. Supplementary Table 3 EN-TEx accession codes. Accession codes for the 116 selected tracks from the EN-TEx dataset used in the imputation of personalized epigenomes. The measurements for "thoracic aorta" and "ascending aorta" were merged into a single tissue label "aorta".

Evaluation Strategy
Measuring the performance of an imputation model is far from trivial. Properly designing an evaluation strategy is crucial to ensure that the results reflect the workings of the models and that the comparison to a selected set of baselines aligns with the use case intended for the model. Ernst and Kellis [16] validated the performance of ChromImpute with a leave-one-out (LOO) strategy, where all tracks short of the one specific target test track are used during different training instances. This strategy provides a substantial prediction advantage due to the optimal number of training data sets. However, it may not give the best assessment of the method's generalization ability [2]. In contrast, the authors of PREDICTD and Avocado have analysed the imputed values for a hold-out set of test tracks using 5-fold cross-validation. This procedure has also been adopted in a recent ENCODE Imputation Challenge [47], and we will do the same in this work. The split of the data is identical to one of the folds used in [19]. Supplementary Figure  1 depicts a data matrix showing which tracks from the Roadmap consortium were part of the training, validation and test sets. Quality control of the Roadmap Consortium guarantees a minimum standard for all data sets [15].
In previous work, including the Encode Imputation Challenge, training, validation and test tracks are processed following established computational pipelines 1 . This pipeline includes a pooling step for biological and technical replicates and subsampling to maximum library size. Notably, the MACS2 peak caller is applied to calculate the statistical significance of enrichment at each base pair in the genome. The resulting genome-wide signal tracks containing the statistical significance of enrichment (i.e., the −log10 p-values) at each base pair in the genome have been used both as input training data and to validate imputations [19]. Recently, however, it has been demonstrated that MACS2 outputs biased P-values and false discovery rate estimates that can be many orders of magnitude too optimistic [5]. This systematic bias raises several problems; for instance, there is no sound basis for choosing a p-value cut off for reporting results. Additionally, as the introduced bias depends on the ChIP-Seq data from which the p values are computed, there can be "distributional shifts" between training and test sets, as observed in the Encode Imputation Challenge.
To reduce the impact of possible systematic biases, we employ a varied set of performance metrics, each aiming to capture different facets of the imputation task. For example, comparing models using Area-Under-Curve metrics ensures that choosing a specific p-value cut-off is not the determining factor in ranking the imputation models.

Performance Metrics
An appropriate choice of metrics is essential for robust comparisons between imputation models. The most immediate and intuitive approach from a machine learning point of view is to measure the ability of a model to reconstruct genome-wide signals (i.e. corrected -log10 p' values). Commonly used measures include genome-wide mean-squared-error (GW MSE) and genomewide Pearson correlation (GW Corr) [19]. Importantly, however, most Histone modifications are only significantly enriched on a small subset of genomic bins. The remaining bins have low background signals, which are strongly determined by experimental conditions, including unspecific binding of the antibody. As these background regions are over-represented on the genome, they easily dominate genome-wide assessment measures.
Durham et al. [18] therefore defined additional performance measures, including MSE1obs, which measures the MSE on the top 1% of the positions according to the ChIP-seq signal; MSE1imp measures the MSE on the top 1% of the positions according to the imputed signal. On the other hand, outside the imputation literature, it is a common strategy to analyse histone modification measurements by focusing on foreground regions which are identified using Peak callers such as MACS2 [22]. In this case, significant enriched regions are merged if they are close together and fluctuations below a given threshold are tolerated within a foreground peak if they are small enough. Additionally, regions of significant enrichment have to have a minimum length (at least as big as the fragment length) to be called foreground peaks. Peak callers therefore provide a more robust segmentation of the genome into foreground and background regions. Here, we follow a hierarchical analysis to evaluate the imputations: First, we test the ability of the model to correctly classify genomic regions into foreground and background regions (e.g. [8]). The signal reconstruction potential of the models is then separately analysed for bins categorized as foreground and background. For the foreground regions, we are additionally interested in how well the peak intensities are predicted (e.g. [9]), and whether the shape of the signal in the peak is well captured by the prediction. (e.g. [10,24]). Table 2 presents the numerical results for the imputation of the test tracks of the Roadmap dataset. For each metric, we highlighted the best-performing model, as well as the competitors for which a 2-sided t-test failed to reject the null hypothesis with a significance threshold of 0.01 (N=203).

Validation on the Roadmap Reference Epigenome
Comparing eDICE to the baselines at the assay level ( Figure 2) and at the individual track level (Figures 3, 4, and 5) highlights that the reported improvements are consistent across the board, with the notable exception of four metrics in which ChromImpute outperforms eDICE (Fg MSE, Precision MACS, F1 MACS, and MCC MACS). For all four of them, the underlying issue is that factorization models such as eDICE and PREDICT consistently underestimate the height of peaks, leading to reduced detection of enriched regions. As a consequence, eDICE is more conservative than ChromImpute in its prediction, a fact that is mirrored in the Recall MACS metrics.
Different assays present different overall imputation performance. For example, in the evaluation of eDICE, H3K9me3 and H3K27me3 displayed consistently suboptimal performance. Figures 7 and 8 highlight how this tendency is not exclusive to eDICE, but shared by the baseline models.

Differential peak analysis
Differential peak analysis aims to detect meaningful differences between biological samples and typically involves multiple replicates in each group of interest. The inclusion of multiple samples enables the discrimination of natural variability within cell types from differentially enriched regions caused by the biological differences between tissues.
To analyse the use of imputed signals for detecting differential enrichment, we chose the H3K9ac assays for cell types E025 (Adipose Derived Mesenchymal Stem Cell Cultured Cells) and E052 (Muscle Satellite Cultured Cells) as samples for a case study. These two epigenomic tracks are part of the hold-out test set, share a comparable number of tracks for the two tissues in the training, validation, and test sets, and have 4 and 3 measured replicates, respectively.
We tested the use of imputations for differential analysis with the Diff-Bind package [25]. DiffBind requires specifying the peak set and providing the aligned reads for treatment and control in a sample sheet. For the measurement group, we use the data from the Roadmap consortium, which includes the aligned reads for each replicate in .tagAlign format and the peaks called with MACS2 in .narrowPeak format. In addition, each replicate is associated with a cell-type-specific control track. The aligned .tagAlign files are converted to BAM format with BedTools [13].
For the imputed tracks, we devised a procedure to simulate replicates from the predicted p-value signal. Such a procedure is an approximation that requires several assumptions, yet it suffices to provide a proof of concept for using imputation models in bioinformatics software pipelines. Nevertheless, for such applications, future imputation models would likely need to focus on modelling not only the average value of a signal but also the uncertainty involved.
To simulate replicates using p-value signals, we assume that the read counts at each genomic position follow a negative binomial distribution, a common model used for sequencing data [14][15][16]. To draw samples from this distribution, we need to estimate two parameters, mean and dispersion.
We estimate the mean parameter by inverting the procedure used to generate the p-value tracks. The p-value signal results from the Survival Function of a background Poisson model. The genomic-position specific mean parameter of the background distribution is calculated from tissue-specific control samples available in the Roadmap dataset. Using the parameters of the background distribution, the read counts for a given track are obtained through the Inverse Survival Function (ISF).
To infer the dispersion parameters for each cell type, we made use of the software package DESeq2 [15]. Specifically, DESeq2 is capable of fitting a function that, given the mean value as input, can output the dispersion value. We selected tracks from the training set for H3K9ac whose tissues present sufficient similarity to the target tissues (E023 for E025 and E107-E108 for E052). These selected tracks are used to estimate the mean-dispersion functions, which are then used on the previously estimated mean parameters to generate bin-wise dispersion parameters.
Given these bin-specific parameters derived from the imputed signals, we generate three samples per cell type by sampling from the negative binomial distribution.
The MACS [22] command bdgcmp and bdgpeakcall were used to generate the peak set on the p-value track for each cell type in both measurement and simulated replicates. Next, the consensus peakset for all samples was estimated using DiffBind. ENCODE blacklisted regions [17] were removed from the consensus peakset. Aggregated counts for these consensus peaks were computed for both measurement and simulated replicates, and were directly read into the DiffBind DBA object with dba.peakset function. Using a False Discovery Rate (FDR) threshold of 0.05, we determined the differentially enriched peaks for both measurements and imputations. An overview of the differential peak analysis pipeline can be found in 23. 4 Interpreting the model 4.1 Global Embeddings eDICE learns global embeddings that capture the high-level relationships between assays and cell types, which we can display through a UMAP [18] projection.
Epigenetic modifications that are generally associated with the same functionality tend to cluster together in the learned representations ( Figure 10). Assigning global roles to epigenetic modifications neglects much of the nuance of the regulatory mechanisms of the genome. Nonetheless, we can consider some general patterns.
H3K27me3 is an antagonist to H3K27ac and is broadly associated to gene repression together with H3K9me3 [23,24].
H3K36me3, H4K20me1, H3K79me1 are linked to exons [25], and we consider these modifications under the label of transcription-associated together with H3K79me2, which is associated with active gene bodies and alternative splicing [26,27].
The partitioning of the epigenetic marks reflects the correlation patterns that emerge from the observed data. We calculated the average tracks for each assay across all cell types and performed agglomerative hierarchical clustering on these tracks. The results, summarised in Supplementary Figure 13, show that modifications with shared global functions tend to cluster together.
Likewise, eDICE is capable of learning the broad similarities between tissue types through the global cell embeddings represented in Figure 11.

Measures of Attention
Inspired by the analysis of attention mechanisms in protein language models [28], we define the portion of attention that each attention head h dedicates to a specific assay a in the set of observed assays A over a portion g of the genome as: where w (g,h) aα denotes the attention weight for the key assay α given the query assay a in the genomic bin g for attention head h. Within the formalism of key-value-query attention in Transformer models, this definition of portion of attention corresponds to the average attention given to each key. Due to the softmax function used in the calculation of the attention weights in the scaled dot-product attention, the portion of attention adds up to one over the space of all assays for a single attention head, which makes it suitable to be expressed as a percentage of the total attention of the attention head.
To evaluate how the state of the attention blocks changes in correspondence with the functional regions of the genome, we calculated of the attention percentage while restricting the genomic bins to the aforementioned annotated regions and evaluated the relative difference in the percentage of attention dedicated to the assays compared to the global pattern in the chromosome. We define this attention shift for a given region R of the genome as: where R is any type of annotated region of the genome, e.g. promoters or enhancers.

Interpreting the Attention Weights
Attention models assign a weight to the relationship between elements in an input set or sequence, which is often considered a proxy to interpret the underlying interactions between said elements. This area of research is mainly active within the application of Transformer models to Natural Language Processing (NLP) tasks [29,30], with a strong focus on the visualization of the attention weights (e.g. [31]).
Despite this excitement, recent works have begun to highlight the limitations of attention as an explanation method, showing that in specific contexts, attention weights are at best noisy predictors of the importance of input components for a model [32] or outright misleading and weak to adversarial attacks [33]. Although these studies present some critical limitations, such as the focus on NLP models or the comparison to alternative measures of feature importance which are not necessarily valid as "ground truth", we believe caution in the analysis of attention weights to be warranted. Adding these considerations to the non-deterministic result of training attention models, we consider the analysis presented here as a diagnostic exploration rather than a detailed claim on the biology examined by the model.
The results presented are derived from the same instance of eDICE used for the reference epigenome validation. Although the details vary due to the stochastic training of the attention block, we observed that the conclusions drawn tend to hold across training instances.
We specifically focus on the attention block used for the contextualization of the assay embeddings. The attention measures employed (detailed in Supplementary Section 4.2) are based on averages of attention weights, which paint a higher-level picture than comparisons of individual coefficients.
To obtain a general overview of the eDICE model, we calculated the percentage of attention given by each head to each assay for the entire chromosome 21. These results, displayed in Figure 12, highlight how at least two of the attention heads consistently dedicate a significant portion of attention to just a few histone modifications. These marks are among the core modifications extensively mapped across many cell types in the Roadmap dataset, i.e. H3K27ac, H3K36me3 H3K9me3, H3K4me1, and H3K4me3.
Interestingly, this phenomenon proved quite consistent over different experiments, with minor variations such as focusing on H3K27me3 rather than H3K27ac. Overall, this result suggests that the model relies extensively on marks for which a large amount of information is present during training.
To further explore the workings of the attention layer in eDICE, we gathered genomic annotations for chromosome 21 from the Ensembl [34] version 104 GRCh37 assembly for protein-coding gene bodies, enhancers, promoters, open chromatin, and transcription factors binding sites. We calculated how the percentage of attention shifts within these functional regions of the genome and displayed the results in Figure 14.
Overall, these attention shifts reveal known patterns from the literature yet present unexplained peculiarities, such as the significant shift within enhancer regions for H3K4me1, a known chromatin hallmark for these functional regions. Intuitively, one would expect the attention weight for H3K4me1 to grow within enhancers, given its biological importance, yet we observe the opposite, especially for attention head 2. Nevertheless, this pattern is inconsistent for such a setting: H3K36me3, a histone modification enriched on the gene body region associated with active gene transcription [35], shows a positive average attention shift within protein-coding genes.
Much remains to be understood about the workings of attention models. Therefore, we limit our analysis to the qualitative consideration that the larger attention shifts within functional regions correspond to known epigenetic marks that characterize said regions. Examples of these patterns include the DNase shifts for open chromatin regions [36], the shifts in DNase and H3K4me1 for transcription factor binding sites [37], and the shifts within promoters for H3K4 methylation [38], H3K9ac, H3K27ac, and their antagonists H3K9me3 and H3K27me3 [19]. We include two high-level visual summaries of the attention shifts in Figures  15 and 16, in which the absolute value of the attention shifts was summed up over the attention heads and the assays, respectively, and then 0-1 normalized for each annotated category. The resulting heatmaps provide at a glance an overview of which assays present the most considerable relative changes and which attention heads are significantly affected within functional regions of the genome.
Interestingly, it appears that the attention heads specialize at least partially, with attention head 2 being the most affected in promoters, enhancers, and open chromatin regions. In contrast, attention head 3 presents the most significant shifts for protein-coding gene bodies and transcription factor binding sites. The scarce shifts for attention head 4 also indicate that this head is not contributing significantly to the final prediction, a known phenomenon in the use of multi-headed attention models, and could be a candidate for pruning [39].

ENCODE imputation challenge model
An early version of our model was used as the basis of our third-placed entry in the 2019 ENCODE Imputation Challenge [40]. This model follows Avocado in adopting the factorized structure of Equation 1, but instead of directly parametrizing genomic bin representations b k , computes non-linear, low dimensional transformations of the input signal at each bin, where vec(Y k ) is a fixed-size vector whose entries are the signal values for all observed tracks at the kth bin, together with zeros for all unobserved tracks. In practice, the functions g θ and f φ are both implemented with two-layer MLPs, and c i and a i are learned cell and assay embeddings, equivalent to the 'global embeddings' u ci and u ai described in the Online Methods (Equations 4 and 6). This model can be seen as a kind of denoising autoencoder in which the decoder has a factorized structure, inherited from Avocado, allowing it to impute signal values for previously unobserved cell-assay combinations. In contrast, our model is fully factorized, replacing a single global transformation of the input signal with cell-and assay-dependent transformations of the input signal, learned via self-attention (Equation 2).
The challenge model is used as a strong baseline in our experiments, since it has demonstrated competitive performance with state-of-the-art methods including Avocado in the blind-test setting of the ENCODE Imputation Challenge.